12-cs01-eda

Professor Shannon Ellis

2/21/23

CS01: Right-To-Carry (EDA)

Q&A

Q: I’m curious about how we would collaborate for the case study as a group with a single file - would it be a bunch of pushes and pulls or is it a little more complicated than that?
A: As long as you’re working on separate lines/parts of the file, you can all push and pull to the same file! So as long as tasks are well delineated and you always remember to pull before you get started (and nobody pushes while you’re working on your part), there won’t be any issues. However, if you’re all working on similar parts/pushing and pulling at the same time, you will run into merge conflicts. These can certainly be handled via git/GitHub but make things a tad more complicated. For those who are less comfortable using GitHub, some groups choose to work in separate .Rmd files, pushing those to your group repo, and then combine them all at the end! This is very much something for your group to discuss!

Course Announcements

Due Dates:

  • Lecture Participation survey “due” after class
  • Lab06 due Friday (2/24; 11:59 PM)
  • HW03 due Mon (2/27; 11:59 PM)

Notes:

  • Final Project Groups survey (link also on canvas; “due” Friday)
    • if you’re in a group: one reply per group
    • if you need a group: one reply per individual
  • Lab05 Answers posted
  • Mid-course survey credit posted
  • Midterm grades to be finalized/posted tomorrow

Agenda

  • Mid-course survey summary
  • See/Discuss some EDA
  • Brainstorm some EDA
  • Do some EDA

Mid-Course Survey

Time Spent

Difficulty

What would you change?

Questions

  1. What is the relationship between right to carry laws and violence rates in the US?
  2. What is the effect of multicollinearity on coefficient estimates from linear regression models when analyzing right to carry laws and violence rates?

Packages & Data

library(tidyverse)
library(skimr) # will need to install first
library(ggrepel) # will need to install first

This will only work if you finished the last set of notes…

load("data/wrangled/wrangled_data_rtc.rda")

LOTT data

glimpse(LOTT_DF)
Rows: 1,364
Columns: 50
$ YEAR                           <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 198…
$ STATE                          <chr> "Alaska", "Arizona", "Arkansas", "Calif…
$ Black_Female_10_to_19_years    <dbl> 0.26391223, 0.28748026, 1.81933049, 0.7…
$ Black_Female_20_to_29_years    <dbl> 0.44331324, 0.27753816, 1.50296508, 0.8…
$ Black_Female_30_to_39_years    <dbl> 0.201146585, 0.165433651, 0.842359498, …
$ Black_Female_40_to_49_years    <dbl> 0.115646931, 0.119305223, 0.633866784, …
$ Black_Female_50_to_64_years    <dbl> 0.092418701, 0.136484590, 1.015244173, …
$ Black_Female_65_years_and_over <dbl> 0.026440644, 0.103332066, 1.156103458, …
$ Black_Male_10_to_19_years      <dbl> 0.29677770, 0.31145827, 1.81159721, 0.8…
$ Black_Male_20_to_29_years      <dbl> 0.69462291, 0.33792181, 1.26270912, 0.8…
$ Black_Male_30_to_39_years      <dbl> 0.29875457, 0.18879028, 0.71111220, 0.5…
$ Black_Male_40_to_49_years      <dbl> 0.147771078, 0.127310077, 0.476448668, …
$ Black_Male_50_to_64_years      <dbl> 0.102797272, 0.130636295, 0.741127809, …
$ Black_Male_65_years_and_over   <dbl> 0.027181971, 0.085421662, 0.870583784, …
$ Other_Female_10_to_19_years    <dbl> 2.04383711, 0.80253231, 0.06531781, 0.5…
$ Other_Female_20_to_29_years    <dbl> 1.76559257, 0.65515527, 0.07942996, 0.6…
$ Other_Female_30_to_39_years    <dbl> 1.24839379, 0.44180215, 0.06702176, 0.6…
$ Other_Female_40_to_49_years    <dbl> 0.79124246, 0.31098310, 0.04216167, 0.3…
$ Other_Female_50_to_64_years    <dbl> 0.74651577, 0.28875958, 0.04390930, 0.4…
$ Other_Female_65_years_and_over <dbl> 0.37906494, 0.16250950, 0.03158848, 0.2…
$ Other_Male_10_to_19_years      <dbl> 2.15157655, 0.81174338, 0.07034226, 0.5…
$ Other_Male_20_to_29_years      <dbl> 1.76361570, 0.59561232, 0.07497349, 0.6…
$ Other_Male_30_to_39_years      <dbl> 1.19971335, 0.38931370, 0.04928327, 0.5…
$ Other_Male_40_to_49_years      <dbl> 0.79519620, 0.25710568, 0.03552066, 0.3…
$ Other_Male_50_to_64_years      <dbl> 0.74058515, 0.23513802, 0.03281182, 0.3…
$ Other_Male_65_years_and_over   <dbl> 0.393397252, 0.150630154, 0.019486117, …
$ White_Female_10_to_19_years    <dbl> 6.121874, 7.373713, 6.669014, 6.720429,…
$ White_Female_20_to_29_years    <dbl> 8.608777, 8.195326, 6.657261, 7.997032,…
$ White_Female_30_to_39_years    <dbl> 7.054710, 6.259248, 5.710656, 6.373367,…
$ White_Female_40_to_49_years    <dbl> 3.749629, 4.414842, 4.319801, 4.342865,…
$ White_Female_50_to_64_years    <dbl> 3.352525, 7.079325, 6.767843, 6.587129,…
$ White_Female_65_years_and_over <dbl> 1.048977, 6.082958, 6.700472, 5.556054,…
$ White_Male_10_to_19_years      <dbl> 6.873085, 7.641858, 6.993288, 7.029783,…
$ White_Male_20_to_29_years      <dbl> 9.804784, 8.406997, 6.564418, 8.471549,…
$ White_Male_30_to_39_years      <dbl> 8.483740, 6.285382, 5.560709, 6.519398,…
$ White_Male_40_to_49_years      <dbl> 4.666650, 4.336730, 4.170641, 4.353268,…
$ White_Male_50_to_64_years      <dbl> 4.103242, 6.210707, 5.993248, 6.065005,…
$ White_Male_65_years_and_over   <dbl> 1.020807, 4.797064, 4.924526, 3.754192,…
$ Unemployment_rate              <dbl> 9.6, 6.6, 7.6, 6.8, 5.8, 7.6, 7.4, 6.1,…
$ Poverty_rate                   <dbl> 9.6, 12.8, 21.5, 11.0, 8.6, 11.8, 20.9,…
$ Viol_crime_count               <dbl> 1919, 17673, 7656, 210290, 15215, 2824,…
$ Population                     <dbl> 404680, 2735840, 2288809, 23792840, 290…
$ police_per_100k_lag            <dbl> 194.72176, 262.66156, 152.00045, 243.92…
$ RTC_LAW_YEAR                   <dbl> 1995, 1995, 1996, Inf, 2003, Inf, Inf, …
$ RTC_LAW                        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ TIME_0                         <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 198…
$ TIME_INF                       <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 201…
$ Viol_crime_rate_1k             <dbl> 4.742018, 6.459808, 3.344971, 8.838373,…
$ Viol_crime_rate_1k_log         <dbl> 1.5564629, 1.8655995, 1.2074581, 2.1791…
$ Population_log                 <dbl> 12.91085, 14.82195, 14.64354, 16.98490,…
skimr::skim(LOTT_DF)
Data summary
Name LOTT_DF
Number of rows 1364
Number of columns 50
_______________________
Column type frequency:
character 1
logical 1
numeric 48
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
STATE 0 1 4 20 0 44 0

Variable type: logical

skim_variable n_missing complete_rate mean count
RTC_LAW 0 1 0.36 FAL: 868, TRU: 496

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
YEAR 0 1 1995.00 8.95 1980.00 1987.00 1995.00 2003.00 2010.00 ▇▇▇▇▇
Black_Female_10_to_19_years 0 1 1.02 1.02 0.02 0.26 0.64 1.44 6.53 ▇▂▁▁▁
Black_Female_20_to_29_years 0 1 1.01 1.09 0.02 0.26 0.61 1.37 7.73 ▇▂▁▁▁
Black_Female_30_to_39_years 0 1 0.93 1.00 0.01 0.21 0.58 1.29 6.11 ▇▂▁▁▁
Black_Female_40_to_49_years 0 1 0.76 0.87 0.01 0.14 0.49 1.10 5.45 ▇▂▁▁▁
Black_Female_50_to_64_years 0 1 0.78 0.97 0.00 0.14 0.45 1.08 6.10 ▇▂▁▁▁
Black_Female_65_years_and_over 0 1 0.62 0.86 0.00 0.08 0.35 0.82 6.12 ▇▁▁▁▁
Black_Male_10_to_19_years 0 1 1.04 1.02 0.03 0.29 0.68 1.47 6.32 ▇▃▁▁▁
Black_Male_20_to_29_years 0 1 0.95 0.93 0.04 0.31 0.66 1.25 6.57 ▇▂▁▁▁
Black_Male_30_to_39_years 0 1 0.82 0.84 0.02 0.24 0.55 1.10 5.37 ▇▂▁▁▁
Black_Male_40_to_49_years 0 1 0.66 0.72 0.01 0.16 0.44 0.93 4.45 ▇▂▁▁▁
Black_Male_50_to_64_years 0 1 0.64 0.76 0.00 0.14 0.40 0.87 4.79 ▇▂▁▁▁
Black_Male_65_years_and_over 0 1 0.39 0.51 0.00 0.06 0.24 0.52 3.56 ▇▁▁▁▁
Other_Female_10_to_19_years 0 1 0.51 0.78 0.03 0.15 0.27 0.56 5.33 ▇▁▁▁▁
Other_Female_20_to_29_years 0 1 0.49 0.71 0.04 0.17 0.30 0.56 5.55 ▇▁▁▁▁
Other_Female_30_to_39_years 0 1 0.48 0.75 0.04 0.15 0.28 0.52 5.36 ▇▁▁▁▁
Other_Female_40_to_49_years 0 1 0.39 0.70 0.02 0.11 0.21 0.38 5.46 ▇▁▁▁▁
Other_Female_50_to_64_years 0 1 0.38 0.84 0.02 0.09 0.18 0.35 7.10 ▇▁▁▁▁
Other_Female_65_years_and_over 0 1 0.25 0.72 0.01 0.04 0.09 0.18 6.20 ▇▁▁▁▁
Other_Male_10_to_19_years 0 1 0.53 0.81 0.03 0.15 0.28 0.58 5.58 ▇▁▁▁▁
Other_Male_20_to_29_years 0 1 0.48 0.71 0.03 0.16 0.29 0.54 5.33 ▇▁▁▁▁
Other_Male_30_to_39_years 0 1 0.44 0.71 0.03 0.14 0.26 0.48 5.06 ▇▁▁▁▁
Other_Male_40_to_49_years 0 1 0.35 0.66 0.02 0.09 0.19 0.34 5.13 ▇▁▁▁▁
Other_Male_50_to_64_years 0 1 0.33 0.74 0.01 0.08 0.16 0.30 6.50 ▇▁▁▁▁
Other_Male_65_years_and_over 0 1 0.19 0.59 0.01 0.03 0.07 0.14 4.51 ▇▁▁▁▁
White_Female_10_to_19_years 0 1 5.69 1.37 0.94 4.96 5.79 6.57 9.45 ▁▁▇▆▁
White_Female_20_to_29_years 0 1 6.07 1.36 1.59 5.23 5.90 6.93 9.65 ▁▂▇▅▂
White_Female_30_to_39_years 0 1 6.15 1.22 1.53 5.45 6.28 7.00 8.95 ▁▁▅▇▂
White_Female_40_to_49_years 0 1 5.56 1.22 1.20 4.84 5.66 6.39 8.33 ▁▁▇▇▂
White_Female_50_to_64_years 0 1 6.55 1.45 1.72 6.00 6.57 7.32 11.40 ▁▂▇▂▁
White_Female_65_years_and_over 0 1 6.40 1.71 1.05 5.37 6.67 7.54 9.90 ▁▁▆▇▂
White_Male_10_to_19_years 0 1 6.00 1.42 1.02 5.26 6.11 6.91 9.74 ▁▁▇▇▁
White_Male_20_to_29_years 0 1 6.26 1.32 2.41 5.42 6.10 7.13 9.96 ▁▃▇▃▁
White_Male_30_to_39_years 0 1 6.25 1.18 1.93 5.57 6.31 7.04 9.67 ▁▂▇▆▁
White_Male_40_to_49_years 0 1 5.56 1.21 1.35 4.77 5.66 6.40 8.24 ▁▁▇▇▃
White_Male_50_to_64_years 0 1 6.23 1.39 1.78 5.62 6.16 6.92 10.93 ▁▂▇▂▁
White_Male_65_years_and_over 0 1 4.56 1.19 1.02 3.80 4.78 5.34 7.51 ▁▂▇▇▁
Unemployment_rate 0 1 6.04 2.11 2.30 4.50 5.60 7.20 17.80 ▇▇▂▁▁
Poverty_rate 0 1 13.39 3.86 5.70 10.40 12.80 15.60 27.20 ▃▇▅▂▁
Viol_crime_count 0 1 32452.11 46790.78 322.00 5598.75 14684.00 39119.00 345624.00 ▇▁▁▁▁
Population 0 1 5559352.78 6092703.87 404680.00 1570224.75 3659637.00 6487139.00 37349363.00 ▇▂▁▁▁
police_per_100k_lag 0 1 315.19 116.43 83.76 247.63 298.45 354.02 1021.14 ▆▇▁▁▁
RTC_LAW_YEAR 0 1 Inf NaN 1985.00 1994.25 1997.00 2011.25 Inf ▇▇▃▅▂
TIME_0 0 1 1980.00 0.00 1980.00 1980.00 1980.00 1980.00 1980.00 ▁▁▇▁▁
TIME_INF 0 1 2010.00 0.00 2010.00 2010.00 2010.00 2010.00 2010.00 ▁▁▇▁▁
Viol_crime_rate_1k 0 1 5.10 3.21 0.48 2.87 4.63 6.47 29.30 ▇▃▁▁▁
Viol_crime_rate_1k_log 0 1 1.46 0.60 -0.74 1.05 1.53 1.87 3.38 ▁▂▇▅▁
Population_log 0 1 15.04 1.02 12.91 14.27 15.11 15.69 17.44 ▃▅▇▅▂

DONOHUE data

glimpse(DONOHUE_DF)
Rows: 1,364
Columns: 20
$ YEAR                      <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 1980, 19…
$ STATE                     <chr> "Alaska", "Arizona", "Arkansas", "California…
$ Black_Male_15_to_19_years <dbl> 0.16704557, 0.17475437, 0.95451390, 0.433886…
$ Black_Male_20_to_39_years <dbl> 0.99337748, 0.52671209, 1.97382132, 1.353260…
$ Other_Male_15_to_19_years <dbl> 1.12978156, 0.41504620, 0.03849163, 0.312308…
$ Other_Male_20_to_39_years <dbl> 2.96332905, 0.98492602, 0.12425676, 1.213007…
$ White_Male_15_to_19_years <dbl> 3.6278047, 4.0915770, 3.7401985, 3.8358473, …
$ White_Male_20_to_39_years <dbl> 18.288524, 14.692380, 12.125127, 14.990947, …
$ Unemployment_rate         <dbl> 9.6, 6.6, 7.6, 6.8, 5.8, 7.6, 7.4, 6.1, 6.3,…
$ Poverty_rate              <dbl> 9.6, 12.8, 21.5, 11.0, 8.6, 11.8, 20.9, 16.7…
$ Viol_crime_count          <dbl> 1919, 17673, 7656, 210290, 15215, 2824, 1277…
$ Population                <dbl> 404680, 2735840, 2288809, 23792840, 2909545,…
$ police_per_100k_lag       <dbl> 194.72176, 262.66156, 152.00045, 243.92632, …
$ RTC_LAW_YEAR              <dbl> 1995, 1995, 1996, Inf, 2003, Inf, Inf, 1988,…
$ RTC_LAW                   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ TIME_0                    <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 1980, 19…
$ TIME_INF                  <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 20…
$ Viol_crime_rate_1k        <dbl> 4.742018, 6.459808, 3.344971, 8.838373, 5.22…
$ Viol_crime_rate_1k_log    <dbl> 1.5564629, 1.8655995, 1.2074581, 2.1791028, …
$ Population_log            <dbl> 12.91085, 14.82195, 14.64354, 16.98490, 14.8…
skimr::skim(DONOHUE_DF)
Data summary
Name DONOHUE_DF
Number of rows 1364
Number of columns 20
_______________________
Column type frequency:
character 1
logical 1
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
STATE 0 1 4 20 0 44 0

Variable type: logical

skim_variable n_missing complete_rate mean count
RTC_LAW 0 1 0.36 FAL: 868, TRU: 496

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
YEAR 0 1 1995.00 8.95 1980.00 1987.00 1995.00 2003.00 2010.00 ▇▇▇▇▇
Black_Male_15_to_19_years 0 1 0.53 0.51 0.02 0.15 0.36 0.74 3.46 ▇▂▁▁▁
Black_Male_20_to_39_years 0 1 1.77 1.76 0.07 0.57 1.19 2.32 11.33 ▇▂▁▁▁
Other_Male_15_to_19_years 0 1 0.26 0.40 0.01 0.08 0.14 0.29 2.90 ▇▁▁▁▁
Other_Male_20_to_39_years 0 1 0.93 1.42 0.07 0.31 0.55 1.01 9.90 ▇▁▁▁▁
White_Male_15_to_19_years 0 1 3.07 0.72 0.55 2.67 3.13 3.52 4.99 ▁▁▇▇▁
White_Male_20_to_39_years 0 1 12.51 2.28 4.41 11.13 12.61 14.13 18.29 ▁▂▇▇▂
Unemployment_rate 0 1 6.04 2.11 2.30 4.50 5.60 7.20 17.80 ▇▇▂▁▁
Poverty_rate 0 1 13.39 3.86 5.70 10.40 12.80 15.60 27.20 ▃▇▅▂▁
Viol_crime_count 0 1 32452.11 46790.78 322.00 5598.75 14684.00 39119.00 345624.00 ▇▁▁▁▁
Population 0 1 5559352.78 6092703.87 404680.00 1570224.75 3659637.00 6487139.00 37349363.00 ▇▂▁▁▁
police_per_100k_lag 0 1 315.19 116.43 83.76 247.63 298.45 354.02 1021.14 ▆▇▁▁▁
RTC_LAW_YEAR 0 1 Inf NaN 1985.00 1994.25 1997.00 2011.25 Inf ▇▇▃▅▂
TIME_0 0 1 1980.00 0.00 1980.00 1980.00 1980.00 1980.00 1980.00 ▁▁▇▁▁
TIME_INF 0 1 2010.00 0.00 2010.00 2010.00 2010.00 2010.00 2010.00 ▁▁▇▁▁
Viol_crime_rate_1k 0 1 5.10 3.21 0.48 2.87 4.63 6.47 29.30 ▇▃▁▁▁
Viol_crime_rate_1k_log 0 1 1.46 0.60 -0.74 1.05 1.53 1.87 3.38 ▁▂▇▅▁
Population_log 0 1 15.04 1.02 12.91 14.27 15.11 15.69 17.44 ▃▅▇▅▂

Population over time

DONOHUE_DF |>
  group_by(YEAR) |>
  summarise(Population = sum(Population)) |>
ggplot(aes(x = YEAR, y = Population)) +
  geom_line() +
  scale_x_continuous(
    breaks = seq(1980, 2010, by = 1),
    limits = c(1980, 2010),
    labels = c(seq(1980, 2010, by = 1))
  ) +
  labs(
    title = "Population has steadily increased",
    x = "Year",
    y = "Population"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90),
        plot.title.position = "plot")

Crime over time

df <- DONOHUE_DF |>
  group_by(YEAR) |>
  summarize(Viol_crime_count = sum(Viol_crime_count),
            Population = sum(Population),
            .groups = "drop") |>
  mutate(Viol_crime_rate_100k_log = log((Viol_crime_count * 100000) / Population))
df |>
  ggplot(aes(x = YEAR, y = Viol_crime_rate_100k_log)) +
  geom_line() +
  scale_x_continuous(
    breaks = seq(1980, 2010, by = 1),
    limits = c(1980, 2010),
    labels = c(seq(1980, 2010, by = 1))
  ) +
  scale_y_continuous(
    breaks = seq(5.75, 6.75, by = 0.25),
    limits = c(5.75, 6.75)
  ) +
  labs(
    title = "Crime rates fluctuate over time",
    x = "Year",
    y = "ln(violent crimes per 100,000 people)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90), 
        plot.title.position = "plot")

Crime over time by state

p <- DONOHUE_DF |>
  mutate(Viol_crime_rate_100k_log = log((Viol_crime_count * 100000) / Population)) |>
  ggplot(aes(x = YEAR, y = Viol_crime_rate_100k_log, color = STATE)) +
  geom_point(size = 0.5) +
  geom_line(aes(group = STATE),
    size = 0.5,
    show.legend = FALSE
  ) +
  geom_text_repel(data = DONOHUE_DF |>
      mutate(Viol_crime_rate_100k_log = log((Viol_crime_count * 100000) / Population)) |>
      filter(YEAR == last(YEAR)),
      aes(label = STATE,x = YEAR, y = Viol_crime_rate_100k_log),
      size = 3, alpha = 1, nudge_x = 1, direction = "y",
      hjust = 1, vjust = 1, segment.size = 0.25, segment.alpha = 0.25,
      force = 1, max.iter = 9999)
p + 
  guides(color = "none") +
  scale_x_continuous(
    breaks = seq(1980, 2015, by = 1),
    limits = c(1980, 2015),
    labels = c(seq(1980, 2010, by = 1), rep("", 5))
  ) +
  scale_y_continuous(
    breaks = seq(3.5, 8.5, by = 0.5),
    limits = c(3.5, 8.5)
  ) +
  labs(
    title = "States have different levels of crime",
    x = "Year", y = "ln(violent crimes per 100,000 people)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90), plot.title.position = "plot")

Police Presence over time

DONOHUE_DF |>
  group_by(YEAR) |>
  summarise(Police = sum(police_per_100k_lag)) |> 
  ggplot(aes(x = YEAR, y = Police)) +
  geom_line() +
  scale_x_continuous(
    breaks = seq(1980, 2010, by = 1),
    limits = c(1980, 2010),
    labels = c(seq(1980, 2010, by = 1))
  ) +
  labs(
    title = "Police Presence has increased over time with fluctuations",
    x = "Year",
    y = "Police Presence per 100K people"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90),
        plot.title.position = "plot")

Your Turn

🧠 Consider the data we’re working with and our questions of interest, what would you like to know that you don’t know yet?

❗ Do some EDA! Try to learn something from the data that we haven’t yet discussed. (Summarize data, make a plot, make a table, etc.)

Where to go from here?

  • Implement some of these ideas
  • This week’s lab - continues the EDA!
    • Consider what variables we have that we haven’t looked at
    • Consider the variables we have looked at but look at them differently
  • Eventually: incorporate some of this and likely some of lab into your final case study